Inn the Neighborhood is an online platform that allows people to rent out their properties for short stays. At the moment, only 2% of people who come to the site interested in renting out their homes start to use it.
The product manager would like to increase this. They want to develop an application to help people estimate how much they could earn renting out their living space. They hope that this would make people more likely to sign up.
The product manager would like to know:
They want to avoid estimating prices that are more than 25 dollars off of the actual price, as this may discourage people.
The data you will use for this analysis can be accessed here: "data/rentals.csv"
(8111, 9)
| id | latitude | longitude | property_type | room_type | bathrooms | bedrooms | minimum_nights | price | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 958 | 37.76931 | -122.43386 | Apartment | Entire home/apt | 1.0 | 1.0 | 1 | $170.00 |
| 1 | 3850 | 37.75402 | -122.45805 | House | Private room | 1.0 | 1.0 | 1 | $99.00 |
| 2 | 5858 | 37.74511 | -122.42102 | Apartment | Entire home/apt | 1.0 | 2.0 | 30 | $235.00 |
| 3 | 7918 | 37.76669 | -122.45250 | Apartment | Private room | 4.0 | 1.0 | 32 | $65.00 |
| 4 | 8142 | 37.76487 | -122.45183 | Apartment | Private room | 4.0 | 1.0 | 32 | $65.00 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8111 entries, 0 to 8110 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 8111 non-null int64 1 latitude 8111 non-null float64 2 longitude 8111 non-null float64 3 property_type 8111 non-null object 4 room_type 8111 non-null object 5 bathrooms 8099 non-null float64 6 bedrooms 8107 non-null float64 7 minimum_nights 8111 non-null int64 8 price 8111 non-null object dtypes: float64(4), int64(2), object(3) memory usage: 570.4+ KB
| id | latitude | longitude | property_type | room_type | bathrooms | bedrooms | minimum_nights | price | |
|---|---|---|---|---|---|---|---|---|---|
| 139 | 144978 | 37.79336 | -122.42506 | Apartment | Private room | NaN | 1.0 | 30 | $56.00 |
| 181 | 229240 | 37.79341 | -122.40340 | Hostel | Shared room | NaN | 1.0 | 1 | $45.00 |
| 196 | 259621 | 37.79470 | -122.40374 | Hostel | Shared room | NaN | 1.0 | 1 | $45.00 |
| 197 | 259622 | 37.79441 | -122.40473 | Hostel | Shared room | NaN | 1.0 | 1 | $45.00 |
| 267 | 430692 | 37.75906 | -122.40761 | Apartment | Private room | NaN | 1.0 | 31 | $68.00 |
| 352 | 596042 | 37.79384 | -122.42436 | Apartment | Private room | NaN | 1.0 | 30 | $63.00 |
| 434 | 785901 | 37.79313 | -122.40443 | Hostel | Shared room | NaN | 1.0 | 1 | $45.00 |
| 435 | 786492 | 37.79421 | -122.40310 | Hostel | Shared room | NaN | 1.0 | 1 | $45.00 |
| 436 | 786506 | 37.79260 | -122.40339 | Hostel | Shared room | NaN | 1.0 | 1 | $45.00 |
| 539 | 1031899 | 37.74833 | -122.42621 | Apartment | Entire home/apt | NaN | 1.0 | 30 | $127.00 |
| 606 | 1206233 | 37.77028 | -122.44757 | Apartment | Private room | NaN | 1.0 | 1 | $79.00 |
| 7036 | 34902361 | 37.77790 | -122.43688 | Apartment | Private room | NaN | 1.0 | 30 | $50.00 |
array([ 1. , 4. , 1.5, 2. , 3. , 0. , 2.5, 3.5, nan, 0.5, 6.5,
10. , 4.5, 14. , 8. , 5. , 6. , 7. ])
| id | latitude | longitude | property_type | room_type | bathrooms | bedrooms | minimum_nights | price | |
|---|---|---|---|---|---|---|---|---|---|
| 269 | 431862 | 37.78321 | -122.41969 | Apartment | Entire home/apt | 1.0 | NaN | 30 | $124.00 |
| 6301 | 32183178 | 37.78883 | -122.48640 | House | Entire home/apt | 3.5 | NaN | 30 | $650.00 |
| 7786 | 38329898 | 37.78347 | -122.41669 | Apartment | Entire home/apt | 1.0 | NaN | 30 | $75.00 |
| 7840 | 38550933 | 37.78979 | -122.41994 | Apartment | Entire home/apt | 1.0 | NaN | 30 | $108.00 |
array([ 1., 2., 0., 3., 4., nan, 5., 6., 14., 7., 8.])
array(['Apartment', 'House', 'Condominium', 'Townhouse', 'Loft',
'Guest suite', 'Cottage', 'Hostel', 'Guesthouse',
'Serviced apartment', 'Bungalow', 'Bed and breakfast', 'Hotel',
'Boutique hotel', 'Other', 'Tiny house', 'Resort', 'Villa',
'Aparthotel', 'Castle', 'Camper/RV', 'In-law', 'Earth house',
'Cabin', 'Dome house', 'Hut'], dtype=object)
array(['Entire home/apt', 'Private room', 'Shared room', 'Hotel room'],
dtype=object)
The dataset has 8111 rows and 9 columns, 16 missing values.
array([ 1, 30, 32, 6, 3, 90,
2, 5, 4, 60, 10, 365,
80, 45, 7, 29, 31, 9,
14, 183, 200, 180, 120, 58,
360, 50, 59, 70, 16, 75,
110, 13, 55, 140, 28, 1125,
85, 21, 18, 100000000, 11, 188,
1000, 40, 65, 100, 12, 38,
8, 25, 15, 150, 33], dtype=int64)
4528
count 4528.000000 mean 263.161661 std 479.444582 min 0.000000 25% 105.000000 50% 169.000000 75% 280.000000 max 10000.000000 Name: price, dtype: float64
I have investigated the target variable and features of the properties, and the relationship between target variable and features. After the analysis,I decided to apply the following changes to enable modeling:
Text(0.5, 1.0, 'Normalized Rental Price Boxplot')
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5,
2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8,
3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. , 5.1,
5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3, 6.4,
6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7,
7.8, 7.9])
([<matplotlib.axis.YTick at 0x2f529da0250>, <matplotlib.axis.YTick at 0x2f529d97a90>, <matplotlib.axis.YTick at 0x2f529dcfe80>, <matplotlib.axis.YTick at 0x2f529dd8430>, <matplotlib.axis.YTick at 0x2f529dd8a30>, <matplotlib.axis.YTick at 0x2f529dd01c0>, <matplotlib.axis.YTick at 0x2f529dd0910>], [Text(0, 1.0, '$10'), Text(0, 1.5, ''), Text(0, 2.0, '$100'), Text(0, 2.5, ''), Text(0, 3.0, '$1000'), Text(0, 3.5, ''), Text(0, 4.0, '$10000')])
| room_type | Entire home/apt | Hotel room | Private room | Shared room |
|---|---|---|---|---|
| property_type | ||||
| Aparthotel | 3 | 2 | 17 | 0 |
| Apartment | 667 | 0 | 424 | 4 |
| Bed and breakfast | 0 | 12 | 17 | 14 |
| Boutique hotel | 15 | 60 | 189 | 0 |
| Bungalow | 8 | 0 | 2 | 1 |
| Cabin | 1 | 0 | 1 | 0 |
| Castle | 0 | 0 | 4 | 0 |
| Condominium | 341 | 0 | 193 | 1 |
| Cottage | 7 | 0 | 2 | 0 |
| Dome house | 1 | 0 | 0 | 0 |
| Earth house | 0 | 0 | 1 | 0 |
| Guest suite | 421 | 0 | 64 | 0 |
| Guesthouse | 19 | 0 | 1 | 0 |
| Hostel | 0 | 35 | 9 | 23 |
| Hotel | 2 | 35 | 110 | 0 |
| House | 664 | 0 | 896 | 7 |
| In-law | 1 | 0 | 0 | 0 |
| Loft | 34 | 0 | 11 | 1 |
| Other | 1 | 0 | 17 | 0 |
| Resort | 0 | 2 | 11 | 0 |
| Serviced apartment | 49 | 18 | 4 | 0 |
| Tiny house | 2 | 0 | 0 | 0 |
| Townhouse | 41 | 0 | 56 | 0 |
| Villa | 1 | 0 | 5 | 0 |
(0.0, 60.0)
Some property types has almost no samples, making it harder for the regression model to adjust using the property type.
I did some analysis on how the lack of samples did affect the predictions and created another column (adj_property_type) with less types.
Each category with less than 100 samples was mapped into the ones with the closest slope (MSE) compared to the nearby property types.
House 1567 Apartment 1095 Condominium 535 Guest suite 485 Boutique hotel 264 Hotel 147 Townhouse 97 Serviced apartment 71 Hostel 67 Loft 46 Bed and breakfast 43 Aparthotel 22 Guesthouse 20 Other 18 Resort 13 Bungalow 11 Cottage 9 Villa 6 Castle 4 Cabin 2 Tiny house 2 Earth house 1 Dome house 1 In-law 1 Name: property_type, dtype: int64
We will eliminate any property type with less than 100 samples and find the best fit (best MSE) to the linear regression of bedrooms x price_normed.
To check only the properties in the same region, we will only check within a margin from the tested dataset.
Serviced apartment --> Apartment Hostel --> House Loft --> Condominium Bed and breakfast --> House Aparthotel --> Townhouse Guesthouse --> Condominium Other --> Apartment Bungalow --> House Resort --> Hotel Cottage --> Hotel Villa --> Condominium Castle --> House Cabin --> Townhouse Tiny house --> Condominium Earth house --> House Dome house --> Townhouse In-law --> Boutique hotel
The most common room_type is the "Entire home/apt"
array([[ 44.85661668],
[ 65.91886454],
[102.1101941 ],
[169.54019535],
[309.75507112],
[651.19599368]])
I checked the location influence on price, and the median price of a single room close to the property has a strong influence on the final price.
To map the regions, we will divide the map in cuts of 0.04 latitude and longitude and check the median price near each point.
For a visual understanding of the region mapping, we will make an example of the single bedroom median for our dataset.
As this mapping uses the whole dataset, we will have to re-create it after separating the train/test datasets.
To enable modelling, we chose id,latitude, longitude, adj_property_type, room_type, bathrooms, bedrooms, minimum_nights as features and price_normed as target variable. I also have made the following changes:
Predicting the price is a regression problem in machine learning. I am choosing the Linear Regression model because we can see strong to moderate relationship between some features and target variable. The comparison model I am choosing is the Gradient Boosting Regressor model because it is easy to interpret and it generally provides better accuracy.
For the evaluation, I am choosing R squared and RMSE (Root Mean Squared Error) to evaluate the model. R squared measures how well the model fits dependent variables (i.e. features). RMSE measures how much your predicted results deviate from the actual number.
Checking property type 5 Checking property type 3 Checking property type 1 Checking property type 2 Checking property type 0 Checking property type 4 Checking property type 6 Done checking property types Checking room type 0 Checking room type 2 Checking room type 1 Checking room type 3 Done checking room types
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
Linear Regression r2_score: 0.56 Linear Regression Root Mean Squared Error: 0.65 Linear Regression predictions: Within $25: 494 Total: 1359 Percentage: 36.4%
array([[ 0.76120252],
[-0.41541559],
[-0.55874855],
...,
[-0.4656158 ],
[-0.18177027],
[ 1.47861684]])
Best max_depth is: 5 Testing R^2 is: 0.6136158440651533
GradientBoostingRegressor(max_depth=5, n_estimators=300, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor(max_depth=5, n_estimators=300, random_state=42)
Gradient Boosting r2_score: 0.61 Gradient Boosting Root Mean Squared Error: 0.61 Gradient Boosting predictions: Within $25: 534 Total: 1359 Percentage: 39.3%
D:\Programs\python\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but PowerTransformer was fitted without feature names warnings.warn(
Text(0.5, 1.0, 'Gradient Boost Regressor Feature Importance')
The R squared of the Linear Regression, and the Gradient Boosting Regressor model is 0.55 and 0.61, meaning the Gradient Boosting Regressor model fits the features better.
The RMSE of the Linear Regression, and the Decision Tree Regression model is 0.65 and 0.61, meaning the Decision Tree Regression model has less error in predicting values.
The company wants to avoid prediction out of a range - more than \$25 off from the actual price. Therefore, we would consider using percentage of predictions which predicted price is not more than \\$25 off from the actual price as a KPI to compare the two models again. The higher the percentage, the better the model performs. 39% of the decision tree regressor prediction is not more than $25 off from the actual rent price, while the linear regression model only have 36%.
To help the user predict the price, we can deploy this GradientBoostingRegressor model into production. By implementing this model, about 39% of the users will be encouraged to rent the property. to better evaluate wither this model can really encourage more people to rent their property, I would also recommend A/B testing about using this model to compare two groups of users.
To implement and improve the model, I will consider the following steps:
The ideal way to deploy this model is to deploy as a web services in the home page of the website, so we can see the effects of the model on having new users renting properties on the platform. The calculator should lead to the registering page, auto-completing the fields filled before to make it as friction-less as possible.
Collecting more data, e.g. area, property age, electronics, internet, garage, demand and seasonality has a big impact on the rental prices for short stays. Some categories had just a single sample. Ideally, having several examples for each possible room_type x property_type combination would improve the predictions.
Feature Engineering, e.g reduce the categories in model, have several region mappings for each category to have a "region classification".
Improve the calculator by showing a price range according to the region variance instead of a single prediction, so the user can also have an idea of the price ranges in the region and use his own knowledge of intangible values such as brand, design and maintenance to set the price.
| room_type | Entire home/apt | Hotel room | Private room | Shared room |
|---|---|---|---|---|
| adj_property_type | ||||
| Apartment | 214.0 | 5.0 | 142.0 | 1.0 |
| Boutique hotel | 3.0 | 13.0 | 69.0 | NaN |
| Condominium | 110.0 | NaN | 80.0 | 2.0 |
| Guest suite | 117.0 | NaN | 20.0 | NaN |
| Hotel | 3.0 | 17.0 | 38.0 | NaN |
| House | 181.0 | 4.0 | 283.0 | 14.0 |
| Townhouse | 12.0 | NaN | 31.0 | NaN |
D:\Programs\python\lib\site-packages\sklearn\utils\validation.py:1111: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Optimization Progress: 0%| | 0/210 [00:00<?, ?pipeline/s]
Generation 1 - Current best internal CV score: 0.8003962038422999 Generation 2 - Current best internal CV score: 0.801019490879051 Generation 3 - Current best internal CV score: 0.801019490879051 Generation 4 - Current best internal CV score: 0.801019490879051 Generation 5 - Current best internal CV score: 0.8026510607366657 Generation 6 - Current best internal CV score: 0.8027330906678115 Generation 7 - Current best internal CV score: 0.8027330906678115 Generation 8 - Current best internal CV score: 0.8027330906678115 Generation 9 - Current best internal CV score: 0.8027330906678115 Generation 10 - Current best internal CV score: 0.8027330906678115 Best pipeline: RandomForestRegressor(KNeighborsRegressor(LinearSVR(input_matrix, C=1.0, dual=True, epsilon=0.001, loss=squared_epsilon_insensitive, tol=0.01), n_neighbors=18, p=2, weights=uniform), bootstrap=False, max_features=0.3, min_samples_leaf=3, min_samples_split=15, n_estimators=100)
TPOTRegressor(generations=10, offspring_size=20, population_size=10,
scoring='r2', verbosity=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. TPOTRegressor(generations=10, offspring_size=20, population_size=10,
scoring='r2', verbosity=2)Linear Regression r2_score: 0.62 Linear Regression Root Mean Squared Error: 0.6 Linear Regression predictions: Within $25: 533 Total: 1359 Percentage: 39.2%
D:\Programs\python\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but LinearSVR was fitted with feature names warnings.warn( D:\Programs\python\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but PowerTransformer was fitted without feature names warnings.warn(
100%|██████████| 100/100 [05:14<00:00, 3.15s/trial, best loss: 0.58719646799117]
{'max_depth': 10.0, 'min_samples_split': 43, 'n_estimators': 330.0}
Linear Regression r2_score: 0.58 Linear Regression Root Mean Squared Error: 0.63 Linear Regression predictions: Within $25: 548 Total: 1359 Percentage: 40.3%
D:\Programs\python\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but PowerTransformer was fitted without feature names warnings.warn(